Skip to content

Conversation

smithto1
Copy link
Member

@smithto1 smithto1 commented Jul 9, 2020

Behavioural Changes
Fixing two related bugs: when grouping on multiple categoricals, .sum() and .count() would return NaN for the missing categories, but they are expected to return 0 for the missing categories. Both these bugs are fixed.

Tests
Tests were added in PR #35022 when these bugs were discovered and the tests were marked with an xfail. For this PR the xfails are removed and the tests are passing normally. As well, a few other existing tests were expecting sum() to return NaN; these have been updated so that the tests now expect to get 0 (which is the desired behaviour).

Pivot
The change to .sum() also impacts the df.pivot_table() if it is called with aggfunc=sum and is pivoted on a Categorical column with observed=False. This is not explicitly mentioned in either of the bugs, but it does make the behaviour consistent (i.e. the sum of a missing category is zero, not NaN). One test on test_pivot.py was updated to reflect this change.

Default Behaviour
Because df.groupby() and df.pivot_table() have observed=False as the default, the default behaviour will change for a user calling df.groupby().sum() or df.pivot_table(..., aggfunc='sum') if they are grouping/pivoting on a categorical with missing categories. Previously the default would give them NaN for the missing categories, now the default will give them 0.

What is the appropriate to highlight/document this change to the default behaviour?

@smithto1 smithto1 marked this pull request as draft July 10, 2020 09:00
@smithto1 smithto1 marked this pull request as ready for review July 10, 2020 09:26
@jreback jreback added Categorical Categorical Data Type Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate labels Jul 10, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smithto1 I am not thrilled with threading this fill arg thru everything. any way to isolate this to just sum itself (e.g. pass in sem thru _agg_general)?

@smithto1
Copy link
Member Author

@jreback I've made another attempt using a different approach. Check out #35241

(I'm more thrilled with #35241 right now, so will mark this one as draft, expecting we'll decline it later.)

@smithto1 smithto1 marked this pull request as draft July 11, 2020 22:46
@smithto1 smithto1 closed this Jul 15, 2020
@smithto1 smithto1 deleted the issue31422 branch July 15, 2020 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: df.groupby().count() returns NaN instead of Zero Inconsistent behavior when groupby pandas Categorical variables
2 participants